Game Analytics: From Exploratory Data Analysis to Predictive Modeling

Author

Hoang Son Lai

Published

November 17, 2025

1. Data Overview & Processing

The data preparation stage begins by loading the raw game session CSV and converting timestamp strings into POSIX datetime objects for start_time and end_time. Missing or problematic values are handled (for example game_duration is set to 0 where missing), and several derived metrics are computed: score_per_second (score divided by duration) and accuracy (UFOs shot divided by bullets fired).

Code
# Load and clean the data
game_data <- read.csv("data/game_sessions.csv", stringsAsFactors = FALSE)

# Data cleaning and preprocessing
game_data_clean <- game_data %>%
  mutate(
    start_time = as.POSIXct(start_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    end_time = as.POSIXct(end_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    death_reason = as.factor(death_reason),
    # Handle missing end_time
    game_duration = ifelse(is.na(game_duration), 0, game_duration),
    # Create performance metrics
    score_per_second = ifelse(game_duration > 0, score / game_duration, 0),
    accuracy = ifelse(bullets_fired > 0, ufos_shot / bullets_fired, 0)
  ) %>%
  filter(!is.na(start_time))  
Code
# Display basic information
cat("Cleaned Dataset Dimensions:", dim(game_data_clean), "\n")
Cleaned Dataset Dimensions: 300 12 
Code
cat("Date Range:", as.character(min(game_data_clean$start_time)), "to", 
    as.character(max(game_data_clean$start_time)), "\n")
Date Range: 2025-11-16 12:33:32.897 to 2025-11-17 04:58:20.094 
Code
head(game_data_clean)
                   id          start_time            end_time score
1 plane_1763296412897 2025-11-16 12:33:32 2025-11-16 12:33:40     8
2 plane_1763296421212 2025-11-16 12:33:41 2025-11-16 12:33:44     6
3 plane_1763296425226 2025-11-16 12:33:45 2025-11-16 12:33:45     0
4 plane_1763296426741 2025-11-16 12:33:46 2025-11-16 12:33:47     0
5 plane_1763296427702 2025-11-16 12:33:47 2025-11-16 12:33:48     0
6 plane_1763296428948 2025-11-16 12:33:48 2025-11-16 12:33:49     0
  coins_collected ufos_shot bullets_fired death_reason game_duration
1               2         2            33         pipe             7
2               0         2            31         pipe             3
3               0         0             0       ground             0
4               0         0             0       ground             0
5               0         0             0       ground             0
6               0         0             0       ground             0
  pipes_passed score_per_second   accuracy
1            3         1.142857 0.06060606
2            1         2.000000 0.06451613
3            0         0.000000 0.00000000
4            0         0.000000 0.00000000
5            0         0.000000 0.00000000
6            0         0.000000 0.00000000
Code
variable_description <- tibble(
  Variable = c(
    "id",
    "start_time",
    "end_time",
    "score",
    "coins_collected",
    "ufos_shot",
    "bullets_fired",
    "death_reason",
    "game_duration",
    "pipes_passed",
    "score_per_second",
    "accuracy"
  ),
  Description = c(
    "Unique session identifier",
    "Timestamp when the game session started",
    "Timestamp when the game session ended",
    "Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)",
    "Number of coins collected by the player",
    "Number of UFO enemies shot",
    "Total number of bullets fired",
    "Cause of death (collision type / hazard)",
    "Total session duration in seconds",
    "Number of pipes the player successfully passed",
    "Score normalized by session duration (score ÷ game_duration)",
    "Shooting accuracy (ufos_shot ÷ bullets_fired)"
  ),
  Type = c(
    "Character",
    "Datetime",
    "Datetime",
    "Integer",
    "Integer",
    "Integer",
    "Integer",
    "Categorical",
    "Numeric",
    "Integer",
    "Numeric",
    "Numeric"
  )
)

variable_description %>%
  gt() %>%
  tab_header(
    title = md("**Variable Description - Plane Game Analytics**")
  ) %>%
  cols_width(
    Variable ~ px(160),
    Description ~ px(420),
    Type ~ px(120)
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )
Table 1
Variable Description - Plane Game Analytics
Variable Description Type
id Unique session identifier Character
start_time Timestamp when the game session started Datetime
end_time Timestamp when the game session ended Datetime
score Final score achieved in the session. Score = coins_collected + (3 × ufos_shot) Integer
coins_collected Number of coins collected by the player Integer
ufos_shot Number of UFO enemies shot Integer
bullets_fired Total number of bullets fired Integer
death_reason Cause of death (collision type / hazard) Categorical
game_duration Total session duration in seconds Numeric
pipes_passed Number of pipes the player successfully passed Integer
score_per_second Score normalized by session duration (score ÷ game_duration) Numeric
accuracy Shooting accuracy (ufos_shot ÷ bullets_fired) Numeric
Code
# Check NA
na_table <- game_data_clean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Column", values_to = "NA Count") %>%
  arrange(desc(`NA Count`))

# Display by kable
kable(na_table)
Table 2: Number of NA values per column in game_data_clean
Column NA Count
id 0
start_time 0
end_time 0
score 0
coins_collected 0
ufos_shot 0
bullets_fired 0
death_reason 0
game_duration 0
pipes_passed 0
score_per_second 0
accuracy 0
Code
# Summary statistics
game_data_clean %>%
  select(score, game_duration, coins_collected, ufos_shot, bullets_fired, pipes_passed, accuracy) %>%
  summary() %>%
  kable()
Table 3: Summary statistics for key variables
score game_duration coins_collected ufos_shot bullets_fired pipes_passed accuracy
Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. :0.00000
1st Qu.: 3.00 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 7.75 1st Qu.: 1.00 1st Qu.:0.02941
Median : 7.00 Median : 4.000 Median : 1.000 Median : 2.00 Median : 19.00 Median : 2.00 Median :0.09839
Mean : 12.59 Mean : 9.053 Mean : 3.863 Mean : 2.91 Mean : 23.19 Mean : 4.58 Mean :0.09363
3rd Qu.: 17.25 3rd Qu.:14.000 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 33.00 3rd Qu.: 6.25 3rd Qu.:0.14733
Max. :134.00 Max. :96.000 Max. :56.000 Max. :28.00 Max. :171.00 Max. :57.00 Max. :0.28846

2. Exploratory Data Analysis

2.1. Distribution Overview

Code
# Tidy dataset long & clean variable names
game_long <- game_data_clean %>%
  select(score, game_duration, accuracy, coins_collected, ufos_shot) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  ) %>%
  mutate(
    variable = str_replace_all(variable, "_", " "),  
    variable = str_to_title(variable)                 
  )

# Facet color
facet_colors <- c(
  "Score"            = "#4e79a7",
  "Game Duration"    = "#f28e2b",
  "Accuracy"         = "#59a14f",
  "Coins Collected"  = "#e15759",
  "Ufos Shot"        = "#76b7b2"
)

# Plot
ggplot(game_long, aes(x = value, fill = variable)) +
  geom_histogram(
    bins = 30,
    alpha = 0.7,
    color = "white",
    linewidth = 0.3
  ) +
  scale_fill_manual(values = facet_colors) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(
    title = "Distribution of Game Performance Metrics",
    x = "Value",
    y = "Count",
    fill = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "none",
    strip.text         = element_text(face = "bold", size = 9),
    plot.title         = element_text(face = "bold", size = 14, hjust = 0.5),
    panel.border       = element_rect(
      color = "white",
      fill  = NA,
      linewidth = 1
    ),
    axis.title.x       = element_text(size = 9, margin = margin(t = 5)),
    axis.title.y       = element_text(size = 9, margin = margin(r = 5)),
    axis.text.x        = element_text(size = 6),
    axis.text.y        = element_text(size = 6)
  )
Figure 1: Distribution of Game Performance Metrics
Code
# Bar chart death reason
death_reason_data <- game_data_clean %>%
  count(death_reason, sort = TRUE) %>%
  mutate(
    death_reason_clean = death_reason %>%
      str_replace_all("_", " ") %>% 
      str_to_title()
  )

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Common Causes Of Death") %>%
  hc_xAxis(
    categories = death_reason_data$death_reason_clean,
    title = list(text = "Death Reason"),
    labels = list(style = list(fontSize = "11px"))
  ) %>%
  hc_yAxis(
    title = list(text = "Number of games"),
    labels = list(style = list(fontSize = "11px"))
  ) %>%
  hc_add_series(
    name = "Count",
    data = death_reason_data$n,
    color = "#1f77b4"
  ) %>%
  hc_tooltip(
    useHTML = TRUE,
    pointFormat = "<b>Count:</b> {point.y}"    
  ) %>%
  hc_plotOptions(
    series = list(
      borderWidth = 0,
      dataLabels = list(enabled = FALSE)
    )
  ) %>%
  hc_legend(enabled = FALSE) %>%              
  hc_exporting(enabled = TRUE)
Figure 2: Common Causes Of Death
Code
# Heatmap correlation
colors <- mako(10)
stops <- seq(0, 1, length.out = length(colors))
custom_colorscale <- lapply(seq_along(colors), function(i) {
  list(stops[i], colors[i])
})

num_vars <- game_data_clean %>%
  select(
    score, game_duration, coins_collected, ufos_shot, 
    bullets_fired, pipes_passed, accuracy
  )

cor_matrix <- cor(num_vars, use = "complete.obs")

plot_ly(
  x = colnames(cor_matrix),
  y = colnames(cor_matrix),
  z = cor_matrix,
  type = "heatmap",
  colorscale = custom_colorscale,
  xgap = 2,
  ygap = 2,
  showscale = TRUE
) %>%
  layout(
    title = list(text = "Correlation Matrix for Key Performance Metrics", x = 0.5),
    xaxis = list(title = "", tickangle = 45, tickfont = list(size = 10)),
    yaxis = list(title = "", autorange = "reversed", tickfont = list(size = 10)),
    plot_bgcolor = "white",
    paper_bgcolor = "white",
    margin = list(l = 80, r = 80, b = 80, t = 80)
  )
Figure 3: Correlation Matrix for Key Performance Metrics

Figure 3

2.2 Death Reason Deep-Dive

Code
# Survival Timeline by Death Reason
format_bin <- function(x) {
  x <- gsub("\\(", "", x)
  x <- gsub("\\]", "", x)
  x <- gsub("\\[", "", x)
  x <- gsub("\\)", "", x)
  x <- gsub(",", "-", x)
  x
}

game_data_binned <- game_data_clean %>%
  mutate(duration_bin = cut(game_duration,
                            breaks = seq(0, 140, by = 5),
                            include.lowest = TRUE)) %>%
  filter(!is.na(duration_bin)) %>%
  mutate(duration_label = format_bin(as.character(duration_bin))) %>%
  count(duration_label, death_reason, name = "count")

duration_levels <- format_bin(as.character(levels(cut(seq(0, 100, by = 5),
                                                      breaks = seq(0, 140, by = 5),
                                                      include.lowest = TRUE))))
game_data_binned$duration_label <- factor(game_data_binned$duration_label,
                                          levels = duration_levels)

timeline_plot <- ggplot(
  game_data_binned,
  aes(
    x = duration_label,
    y = count,
    color = death_reason,
    group = death_reason,
    text = paste0(
      "<b>Death Reason:</b> ", death_reason, "<br>",
      "<b>Duration:</b> ", duration_label, " sec<br>",
      "<b>Count:</b> ", count
    )
  )
) +
  geom_line(size = 0.7) +
  geom_point(size = 1.5) +
  labs(
    title = "Survival Timeline by Death Reason",
    x = "Game Duration (seconds)",
    y = "Number of Deaths",
    color = "Death Reason"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = 5)),
    axis.text.y = element_text(margin = margin(r = 5))
  )

ggplotly(timeline_plot, tooltip = "text") %>%
  layout(
    title = list(
      text = "<b>Survival Timeline by Death Reason</b>",  
      x = 0.5,          
      xanchor = "center",
      font = list(size = 17)  
    ),
    legend = list(
      orientation = "h",
      x = 0.5,
      xanchor = "center",
      y = -0.25,
      yanchor = "top"
    ),
    xaxis = list(
      title_standoff = 20
    ),
    yaxis = list(
      title_standoff = 20
    ),
    margin = list(b = 160)  
  )
Figure 4: Survival Timeline by Death Reason
Code
#  Distribution of score by death_reason
stats <- game_data_clean %>%
  group_by(death_reason) %>%
  summarise(
    count = n(),
    mean  = mean(score),
    min   = min(score),
    q1    = quantile(score, 0.25),
    median= median(score),
    q3    = quantile(score, 0.75),
    max   = max(score)
  )

df <- left_join(game_data_clean, stats, by = "death_reason")

p <- plot_ly()

unique_reasons <- unique(df$death_reason)

for (dr in unique_reasons) {

  dsub <- df %>% filter(death_reason == dr)

  cd <- as.matrix(dsub[, c("count","mean","min","q1","median","q3","max")])

  p <- add_trace(
    p,
    data = dsub,
    x = ~death_reason,
    y = ~score,
    type = "violin",
    name = dr,
    box = list(visible = TRUE),
    meanline = list(visible = TRUE),
    customdata = cd,
    hovertemplate = paste(
      "<b>Death reason:</b> ", dr, "<br>",
      "<b>Score:</b> %{y}<br><br>",
      "<b>Count:</b> %{customdata[0]}<br>",
      "<b>Mean:</b> %{customdata[1]:.2f}<br>",
      "<b>Min:</b> %{customdata[2]}<br>",
      "<b>Q1:</b> %{customdata[3]}<br>",
      "<b>Median:</b> %{customdata[4]}<br>",
      "<b>Q3:</b> %{customdata[5]}<br>",
      "<b>Max:</b> %{customdata[6]}<extra></extra>"
    )
  )
}

 p %>% layout(
  title = "Score Distribution by Death Reason",
  xaxis = list(title = "Death Reason"),
  yaxis = list(title = "Score")
)
Figure 5: Score Distribution by Death Reason
Code
# Expected Value of Score Lost per Death Type
ev_loss <- game_data_clean %>%
    group_by(death_reason) %>%
    rename(`Death reason` = death_reason) %>%
    summarise(
        `Mean score` = mean(score),
        `Median score` = median(score),
        `Count of deaths` = n(),
        .groups = 'drop'
    ) %>%
    arrange(desc(`Mean score`))

ev_loss %>% kable()
Table 4: Expected Value of Score per Death Reason
Death reason Mean score Median score Count of deaths
ufo_collision 30.3333333 30.0 3
enemy_bullet 28.2812500 22.5 32
pipe 14.5054945 8.0 182
ceiling 10.1250000 3.0 8
ground 0.8133333 0.0 75

2.3 Behavioral Feature Engineering

Code
# Behavioral features created 
game_data_enhanced <- game_data_clean %>%
    mutate(
        aggressiveness = ifelse(game_duration > 0, bullets_fired / game_duration, 0),
        efficiency = ifelse(game_duration > 0, score / game_duration, 0),
        risk_taking = ifelse(pipes_passed > 0, ufos_shot / pipes_passed, 0),
        coin_rate = ifelse(game_duration > 0, coins_collected / game_duration, 0),
        ufo_rate = ifelse(game_duration > 0, ufos_shot / game_duration, 0),
        pipe_pass_rate = ifelse(game_duration > 0, pipes_passed / game_duration, 0),
        complexity = bullets_fired + ufos_shot + pipes_passed,
        skill_tier = cut(score, 
                        breaks = c(-1, 4, 9, 14, 29, 49, Inf),
                        labels = c("0-4", "5-9", "10-14", "15-29", "30-49", "50+"))
    )
Code
# Behavioral features
behavioral_vars <- game_data_enhanced %>%
  select(aggressiveness, efficiency, accuracy, risk_taking,
         coin_rate, ufo_rate, pipe_pass_rate, complexity)

# Correlation matrix
behavioral_cor <- cor(behavioral_vars, use = "complete.obs")

# Heatmap Plotly 
plot_ly(
  x = colnames(behavioral_cor),
  y = colnames(behavioral_cor),
  z = behavioral_cor,
  type = "heatmap",
  colorscale = "RdBu",   
  zmin = -1,
  zmax = 1,
  xgap = 2,
  ygap = 2,
  showscale = TRUE
) %>%
  layout(
    title = list(text = "Behavioral Features Correlation Matrix", x = 0.5),
    xaxis = list(title = "", tickangle = 45, tickfont = list(size = 10)),
    yaxis = list(title = "", autorange = "reversed", tickfont = list(size = 10)),
    plot_bgcolor = "white",
    paper_bgcolor = "white",
    margin = list(l = 80, r = 80, b = 80, t = 80)
  )
Figure 6: Behavioral Features Correlation Matrix

2.4. Session Progression Analysis

Code
# (A) Score vs Duration with trendline
score_duration_plot <- ggplot(game_data_enhanced, 
                            aes(x = game_duration, y = score)) +
    geom_point(alpha = 0.6, color = "#1f77b4") +
    geom_smooth(method = "loess", color = "#ff7f0e", se = TRUE) +
    labs(title = "Score vs Game Duration with Trendline",
         x = "Game Duration (seconds)",
         y = "Score") +
    theme_minimal()

ggplotly(score_duration_plot)
Figure 7: Score vs Game Duration with Trendline
Code
# (B) Bullets vs UFO Shot 
efficiency_plot <- ggplot(game_data_enhanced,
                         aes(x = bullets_fired, y = ufos_shot, color = skill_tier)) +
    geom_point(alpha = 0.7) +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Bullets Fired vs UFOs Shot",
         x = "Bullets Fired", 
         y = "UFOs Shot",
         color = "Skill Tier (by Score)") +
    theme_minimal()

ggplotly(efficiency_plot)
Figure 8: Bullets Fired vs UFOs Shot

2.5 Clusterable Structure Check

Code
# PCA Analysis
cluster_data <- game_data_enhanced %>%
    select(score, game_duration, coins_collected, bullets_fired,
           ufos_shot, pipes_passed, aggressiveness, efficiency,
           accuracy, risk_taking) %>%
    scale()

pca_result <- prcomp(cluster_data, scale. = TRUE)

pca_df <- as.data.frame(pca_result$x)
pca_df$death_reason <- game_data_enhanced$death_reason
pca_df$score <- game_data_enhanced$score
pca_df$game_duration <- game_data_enhanced$game_duration
pca_df$accuracy <- round(game_data_enhanced$accuracy, 3)
pca_df$coins_collected <- game_data_enhanced$coins_collected
pca_df$ufos_shot <- game_data_enhanced$ufos_shot
pca_df$pipes_passed <- game_data_enhanced$pipes_passed

pca_df$tooltip <- paste(
    "Death reason:", pca_df$death_reason, "<br>",
    "Score:", pca_df$score, "<br>",
    "Duration:", pca_df$game_duration, "s<br>",
    "Accuracy:", pca_df$accuracy, "<br>",
    "Coins:", pca_df$coins_collected, "<br>",
    "UFOs:", pca_df$ufos_shot, "<br>",
    "Pipes:", pca_df$pipes_passed
)

pca_plot <- plot_ly(pca_df, x = ~PC1, y = ~PC2, 
                   color = ~death_reason,
                   type = 'scatter', mode = 'markers',
                   text = ~tooltip,
                   hoverinfo = 'text',
                   marker = list(size = 8, opacity = 0.7)) %>%
    layout(title = "PCA - Player Behavior Patterns by Death Reason",
           xaxis = list(title = paste0("PC1 (", 
                                     round(100 * pca_result$sdev[1]^2/sum(pca_result$sdev^2), 1), "%)")),
           yaxis = list(title = paste0("PC2 (", 
                                     round(100 * pca_result$sdev[2]^2/sum(pca_result$sdev^2), 1), "%)")))

pca_plot
Figure 9: PCA - Player Behavior Patterns by Death Reason
Code
# PCA Loadings - Meaning of Principal components
pca_loadings <- as.data.frame(pca_result$rotation[, 1:2])
pca_loadings$feature <- rownames(pca_loadings)

loadings_plot <- ggplot(pca_loadings, aes(x = PC1, y = PC2, label = feature)) +
    geom_point(size = 3, color = "blue") +
    geom_text_repel(size = 4, max.overlaps = 10) +
    geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
    geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
    labs(title = "PCA Loadings - Meaning of Principal Components",
         x = paste0("PC1 (", round(100 * pca_result$sdev[1]^2/sum(pca_result$sdev^2), 1), "%)"),
         y = paste0("PC2 (", round(100 * pca_result$sdev[2]^2/sum(pca_result$sdev^2), 1), "%)")) +
    theme_minimal()

loadings_plot
Figure 10: PCA Loadings - Meaning of Principal Components

3. Bootstrapping Data for Machine Learning

Code
# Sort by time and split data
game_sorted <- game_data_enhanced %>% arrange(start_time)
train_base <- head(game_sorted, 250)
test_holdout <- tail(game_sorted, 50)

# Bootstrap training data
set.seed(123)
bootstrap_size <- 10000
train_bootstrapped <- train_base %>% 
    slice_sample(n = bootstrap_size, replace = TRUE) %>% 
    mutate(is_synthetic = TRUE)

cat("Original Train Size:", nrow(train_base), "\n")
Original Train Size: 250 
Code
cat("Bootstrapped Train Size:", nrow(train_bootstrapped), "\n")
Bootstrapped Train Size: 10000 
Code
cat("Holdout Test Size:", nrow(test_holdout), "\n")
Holdout Test Size: 50 
Code
# Compare distribution real vs bootstrapped
compare_plot <- ggplot() +
    geom_density(data = train_base, aes(x = score, color = "Real"), size = 1) +
    geom_density(data = train_bootstrapped, aes(x = score, color = "Bootstrapped"), size = 1, alpha = 0.7) +
    labs(title = "Score Distribution: Real vs Bootstrapped Data",
         x = "Score", y = "Density", color = "Data Type") +
    theme_minimal()

compare_plot
Figure 11: Score Distribution: Real vs Bootstrapped Data

4. Segmentation (Unsupervised Learning)

Code
# Features for clustering
cluster_features_enhanced <- train_bootstrapped %>% 
    select(score, game_duration, coins_collected, bullets_fired,
           ufos_shot, pipes_passed, aggressiveness, efficiency,
           accuracy, risk_taking)

scaled_features_enhanced <- scale(cluster_features_enhanced)

# KMeans clustering with 3 clusters
set.seed(123)
kmeans_enhanced <- kmeans(scaled_features_enhanced, centers = 3, nstart = 25)
train_bootstrapped$cluster_enhanced <- as.factor(kmeans_enhanced$cluster)

# Visualize clusters
fviz_cluster(kmeans_enhanced, data = scaled_features_enhanced,
            geom = "point", ellipse.type = "convex",
            ggtheme = theme_minimal(),
            main = "Enhanced Player Segmentation (K-Means)")
Figure 12: Enhanced Player Segmentation (K-Means)
Code
# Cluster profiles
cluster_profiles <- train_bootstrapped %>%
    group_by(cluster_enhanced) %>%
    rename(`Cluster enhanced` = cluster_enhanced) %>%
    summarise(
        Count = n(),
        `Avg Score` = mean(score),
        `Avg Duration` = mean(game_duration),
        `Avg Coins` = mean(coins_collected),
        `Avg UFOs` = mean(ufos_shot),
        `Avg Bullets` = mean(bullets_fired),
        `Avg Aggressiveness` = mean(aggressiveness),
        `Avg Efficiency` = mean(efficiency),
        `Avg Accuracy` = mean(accuracy),
        .groups = 'drop'
    )

cluster_profiles %>% kable()
Table 5: Cluster profiles
Cluster enhanced Count Avg Score Avg Duration Avg Coins Avg UFOs Avg Bullets Avg Aggressiveness Avg Efficiency Avg Accuracy
1 956 52.4592050 37.00628 19.1140167 11.1150628 70.791841 2.0757690 1.5083939 0.1591852
2 3061 0.4168572 1.40477 0.2198628 0.0656648 1.815093 0.3103436 0.0581862 0.0039203
3 5983 10.7751964 7.54588 2.6095604 2.7218787 23.927628 5.4787370 1.8245718 0.1212711

5. Predictive Modeling (Supervised Learning)

5.1 Score Regression

Code
# Define features
features_enhanced <- c("coins_collected", "ufos_shot", "bullets_fired", 
                      "game_duration", "pipes_passed", "aggressiveness",
                      "efficiency", "accuracy", "risk_taking")

# Random Forest with enhanced features
rf_enhanced <- randomForest(
    as.formula(paste("score ~", paste(features_enhanced, collapse = "+"))),
    data = train_bootstrapped,
    ntree = 100,
    importance = TRUE
)

# Predict and assess
predictions_rf_enhanced <- predict(rf_enhanced, newdata = test_holdout)
rmse_enhanced <- RMSE(predictions_rf_enhanced, test_holdout$score)
r2_enhanced <- R2(predictions_rf_enhanced, test_holdout$score)

cat("Enhanced Random Forest Performance:\n")
Enhanced Random Forest Performance:
Code
cat("RMSE:", round(rmse_enhanced, 2), "\n")
RMSE: 0.84 
Code
cat("R-Squared:", round(r2_enhanced, 4), "\n")
R-Squared: 0.9961 
Code
# Variable importance
varImpPlot(rf_enhanced, main = "Enhanced Feature Importance for Score Prediction")
Figure 13: Enhanced Feature Importance for Score Prediction

5.2 Survival Prediction

Code
threshold <- 30
train_bootstrapped <- train_bootstrapped %>%  
    mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

test_holdout <- test_holdout %>%  
    mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

Random Forest Classifier

Code
# Random Forest for Survival Prediction
rf_survival_model <- randomForest(
    survived_expert ~ bullets_fired + ufos_shot + coins_collected + 
                     aggressiveness + efficiency + accuracy + pipe_pass_rate,
    data = train_bootstrapped,
    ntree = 200,
    mtry = 3,
    importance = TRUE
)

# Predictions
rf_predictions <- predict(rf_survival_model, newdata = test_holdout)
rf_probabilities <- predict(rf_survival_model, newdata = test_holdout, type = "prob")[,2]

# Evaluation
rf_conf_matrix <- confusionMatrix(rf_predictions, test_holdout$survived_expert)
cat("Random Forest Accuracy:", round(rf_conf_matrix$overall['Accuracy'], 4), "\n")
Random Forest Accuracy: 0.98 
Code
# Feature Importance
varImpPlot(rf_survival_model, main = "Random Forest - Survival Prediction Feature Importance")
Figure 14: Random Forest - Survival Prediction Feature Importance

XGBoost Classifier

Code
# XGBoost for Survival Prediction
# Prepare data for XGBoost
xgb_features <- c("bullets_fired", "ufos_shot", "coins_collected", 
                  "aggressiveness", "efficiency", "accuracy", "pipe_pass_rate")

xgb_train <- as.matrix(train_bootstrapped[xgb_features])
xgb_test <- as.matrix(test_holdout[xgb_features])
xgb_labels_train <- as.numeric(as.character(train_bootstrapped$survived_expert))
xgb_labels_test <- as.numeric(as.character(test_holdout$survived_expert))

# XGBoost model
xgb_model <- xgboost(
    data = xgb_train,
    label = xgb_labels_train,
    nrounds = 100,
    objective = "binary:logistic",
    eval_metric = "logloss",
    max_depth = 6,
    eta = 0.1,
    verbose = 0
)

# Predictions
xgb_probabilities <- predict(xgb_model, xgb_test)
xgb_predictions <- ifelse(xgb_probabilities > 0.5, 1, 0)

# Evaluation
xgb_conf_matrix <- confusionMatrix(as.factor(xgb_predictions), as.factor(xgb_labels_test))
cat("XGBoost Accuracy:", round(xgb_conf_matrix$overall['Accuracy'], 4), "\n")
XGBoost Accuracy: 0.96 
Code
# Feature Importance
xgb_importance <- xgb.importance(feature_names = xgb_features, model = xgb_model)
xgb.plot.importance(xgb_importance, main = "XGBoost Feature Importance")
Figure 15: XGBoost Feature Importance

5.3 Death Reason Prediction (Multiclass Classification)

Code
# DEATH REASON PREDICTION 

# 1. Data Preparation
death_features <- c("score", "game_duration", "coins_collected", "ufos_shot", 
                   "bullets_fired", "pipes_passed", "aggressiveness", "accuracy")

train_data <- train_bootstrapped %>%
  filter(!is.na(death_reason)) %>%
  select(all_of(death_features), death_reason)

test_data <- test_holdout %>%
  filter(!is.na(death_reason)) %>%
  select(all_of(death_features), death_reason)

# 2. Train XGBoost Model
x_train <- as.matrix(train_data[death_features])
y_train <- as.integer(train_data$death_reason) - 1
x_test <- as.matrix(test_data[death_features])
y_test <- as.integer(test_data$death_reason) - 1

num_classes <- length(unique(y_train))

xgb_model <- xgboost(
  data = x_train,
  label = y_train,
  nrounds = 100,
  objective = "multi:softprob",
  num_class = num_classes,
  eval_metric = "mlogloss",
  max_depth = 6,
  eta = 0.1,
  verbose = 0
)

# 3. Predictions & Evaluation
predictions_prob <- predict(xgb_model, x_test, reshape = TRUE)
predicted_classes <- apply(predictions_prob, 1, which.max) - 1

accuracy <- mean(predicted_classes == y_test)
cat("Death Reason Prediction Accuracy:", round(accuracy, 4), "\n")
Death Reason Prediction Accuracy: 0.6 
Code
# 4. Detailed Analysis
class_names <- levels(train_data$death_reason)
predicted_factor <- factor(predicted_classes, levels = 0:(num_classes-1), labels = class_names)
actual_factor <- factor(y_test, levels = 0:(num_classes-1), labels = class_names)

conf_matrix <- confusionMatrix(predicted_factor, actual_factor)
print(conf_matrix)
Confusion Matrix and Statistics

               Reference
Prediction      ceiling enemy_bullet ground pipe ufo_collision
  ceiling             1            0      0    0             0
  enemy_bullet        0            4      0    2             0
  ground              0            0      4    1             0
  pipe                2           10      2   21             0
  ufo_collision       0            2      0    1             0

Overall Statistics
                                          
               Accuracy : 0.6             
                 95% CI : (0.4518, 0.7359)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 0.1013          
                                          
                  Kappa : 0.3316          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: ceiling Class: enemy_bullet Class: ground
Sensitivity                  0.3333              0.2500        0.6667
Specificity                  1.0000              0.9412        0.9773
Pos Pred Value               1.0000              0.6667        0.8000
Neg Pred Value               0.9592              0.7273        0.9556
Prevalence                   0.0600              0.3200        0.1200
Detection Rate               0.0200              0.0800        0.0800
Detection Prevalence         0.0200              0.1200        0.1000
Balanced Accuracy            0.6667              0.5956        0.8220
                     Class: pipe Class: ufo_collision
Sensitivity               0.8400                   NA
Specificity               0.4400                 0.94
Pos Pred Value            0.6000                   NA
Neg Pred Value            0.7333                   NA
Prevalence                0.5000                 0.00
Detection Rate            0.4200                 0.00
Detection Prevalence      0.7000                 0.06
Balanced Accuracy         0.6400                   NA
Code
# 5. Feature Importance & Insights
importance <- xgb.importance(feature_names = death_features, model = xgb_model)
print("Top Predictors:")
[1] "Top Predictors:"
Code
print(importance)
           Feature        Gain       Cover  Frequency
            <char>       <num>       <num>      <num>
1:   bullets_fired 0.351783015 0.224402007 0.18239830
2:    pipes_passed 0.340037786 0.184151096 0.07737304
3:  aggressiveness 0.101117013 0.199416000 0.21882478
4:   game_duration 0.079811607 0.178475833 0.11805371
5:           score 0.066679940 0.081125352 0.18000532
6:        accuracy 0.055432656 0.112550782 0.17575113
7: coins_collected 0.003354383 0.017736060 0.03483116
8:       ufos_shot 0.001783599 0.002142871 0.01276256
Code
# Per-class accuracy
class_acc <- diag(conf_matrix$table) / colSums(conf_matrix$table)
cat("\nPer-Class Accuracy:\n")

Per-Class Accuracy:
Code
print(round(class_acc, 3))
      ceiling  enemy_bullet        ground          pipe ufo_collision 
        0.333         0.250         0.667         0.840           NaN 

6. Business Insights & Recommendations

Based on the analysis above, we derive the following actionable insights:

6.1. Difficulty Balancing:

Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.

Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.

6.2. Player Segmentation Strategy:

Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).

Recommendation: Introduce targeted rewards.

  • For ‘Survivors’ (High duration, low action): Introduce time-based achievements.

  • For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.

6.3. Predictive Engagement:

Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.

Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.

6.4. Monetization Opportunities:

Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.

Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.